Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding website ingestion! #325

Merged
merged 14 commits into from
Nov 19, 2024
Merged

Adding website ingestion! #325

merged 14 commits into from
Nov 19, 2024

Conversation

amindadgar
Copy link
Member

@amindadgar amindadgar commented Nov 17, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced CrawleeClient for enhanced web crawling capabilities.
    • Added ModulesWebsite class for managing website community data.
    • Launched WebsiteETL class for streamlined data extraction, transformation, and loading from URLs.
    • Implemented a new Airflow DAG named website_ingestion_embedding for processing website community data.
  • Bug Fixes

    • Updated return type of get_platform_metadata method for improved flexibility.
  • Chores

    • Updated dependency: tc-analyzer-lib to version 1.4.13.
    • Added new dependencies: crawlee[playwright]==0.3.8, defusedxml==0.7.1, and pydantic==2.9.2.
  • Tests

    • Introduced unit tests for the WebsiteETL class to ensure core functionalities are working as intended.

@amindadgar amindadgar linked an issue Nov 17, 2024 that may be closed by this pull request
Copy link
Contributor

coderabbitai bot commented Nov 17, 2024

Walkthrough

The changes introduce several new classes and methods for web crawling and data processing. A CrawleeClient class is created to handle web crawling via the crawlee library. A new ModulesWebsite class extends ModulesBase to manage website community data. The WebsiteETL class encapsulates the ETL process for extracting, transforming, and loading data into a vector database. Additionally, a new Airflow DAG is added for processing website community data. The requirements.txt file is updated to include the crawlee library and other dependencies.

Changes

File Path Change Summary
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py Added CrawleeClient class with methods for crawling and handling requests.
dags/hivemind_etl_helpers/src/utils/modules/__init__.py Imported ModulesWebsite from website.
dags/hivemind_etl_helpers/src/utils/modules/modules_base.py Updated get_platform_metadata return type from `str
dags/hivemind_etl_helpers/src/utils/modules/website.py Introduced ModulesWebsite class with get_learning_platforms method for community data.
dags/hivemind_etl_helpers/website_etl.py Added WebsiteETL class with methods for ETL process: extract, transform, load.
dags/hivemind_website_ingestion.py Added DAG: website_ingestion_embedding with tasks for community data ingestion.
requirements.txt Added new dependency: crawlee[playwright]==0.3.8, updated tc-analyzer-lib version, and added defusedxml==0.7.1.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WebsiteETL
    participant CrawleeClient
    participant CustomIngestionPipeline

    User->>WebsiteETL: extract(urls)
    WebsiteETL->>CrawleeClient: crawl(urls)
    CrawleeClient-->>WebsiteETL: crawled data
    WebsiteETL->>WebsiteETL: transform(crawled data)
    WebsiteETL->>CustomIngestionPipeline: load(transformed documents)
Loading

🐰 "In the meadow, changes bloom,
Crawling webs, dispelling gloom.
With each hop, new paths we find,
Data dances, intertwined.
A joyful leap, we celebrate,
The magic of our code is great!" 🌼

Possibly related PRs

  • feat: discourse fetch raw data; #201: The ExtractRawInfo class in this PR is related to the CrawleeClient class in the main PR as both involve data extraction processes, although they target different platforms (Discourse vs. Crawlee).
  • Feat/218 violation detection #221: This PR involves violation detection, which may relate to the data extraction and processing functionalities introduced in the main PR, particularly in the context of handling extracted data.
  • Update requirements.txt #289: The updates to the tc-analyzer-lib in this PR may indirectly relate to the main PR if the library is used in conjunction with the CrawleeClient for data processing or analysis.
  • feat: using endpoint for all discourse posts! #297: This PR modifies the handling of discourse posts by replacing forumUuid with an endpoint, which aligns with the main PR's focus on web crawling and data extraction from URLs.
  • Telegram Summarizer ETL Added! #314: The addition of the ExtractMessagesDaily class in this PR complements the main PR's data extraction functionalities, focusing on daily message extraction from Telegram.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (13)
requirements.txt (1)

25-25: Consider documenting playwright system dependencies

The [playwright] extra requires additional system dependencies for browser automation.

Consider adding a note in the repository's README.md about the system requirements for playwright, such as:

  • Required browser dependencies
  • Minimum system requirements
  • Installation steps for different operating systems

This will help other developers set up the environment correctly.

dags/hivemind_etl_helpers/src/utils/modules/website.py (3)

5-5: Remove extra blank line.

There's an unnecessary extra blank line that can be removed.


11-30: Consider improving type hints and docstring.

While the documentation is good, there are a few improvements that could make it more maintainable and clearer:

  1. The complex return type could benefit from a type alias
  2. The docstring could include information about potential exceptions
  3. The docstring could clarify the expected format of the URLs

Consider applying these improvements:

+from typing import TypeAlias
+
+WebsiteCommunityData: TypeAlias = dict[str, str | list[str]]
+
 def get_learning_platforms(
     self,
-) -> list[dict[str, str | list[str]]]:
+) -> list[WebsiteCommunityData]:
     """
     Get all the website communities with their page titles.
+
+    Raises
+    ------
+    Exception
+        When unable to fetch platform metadata
+
+    Notes
+    -----
+    URLs should be fully qualified (including protocol) and accessible

31-33: Consider batch processing for performance optimization.

The current implementation queries platform metadata individually for each platform. Consider implementing batch processing if the underlying get_platform_metadata method supports it.

dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1)

7-61: Consider architectural improvements for production

For production-grade web crawling, consider these architectural improvements:

  1. Add request queuing system (e.g., Redis) for distributed crawling
  2. Implement retry mechanism with exponential backoff
  3. Add metrics collection (success rate, timing, etc.)
  4. Consider using proxy rotation for large-scale crawling

Would you like me to provide implementation details for any of these suggestions?

dags/hivemind_website_ingestion.py (1)

15-98: Consider architectural improvements

Both DAGs share similar structure and functionality. Consider these improvements:

  1. Create a base DAG class or factory function to reduce code duplication
  2. Implement proper error handling and retries for the ETL tasks
  3. Add monitoring and alerting for task failures

Would you like help implementing these architectural improvements?

🧰 Tools
🪛 Ruff

36-36: Local variable links is assigned to but never used

Remove assignment to unused variable links

(F841)


44-44: Undefined name selected_channels

(F821)


45-45: Undefined name from_date

(F821)


52-52: Undefined name start_discord_vectorstore

(F821)


70-70: Undefined name ModulesDiscord

(F821)

dags/hivemind_etl_helpers/src/utils/modules/modules_base.py (1)

Line range hint 101-115: Update docstring to match the method's current purpose

The docstring is inconsistent with the method's current functionality:

  1. The description still states "get the userid that belongs to a platform" but the method now handles generic metadata
  2. The return type in the docstring uses "Any" but the method signature specifies str | dict | list

Apply this diff to fix the docstring:

-        get the userid that belongs to a platform
+        get metadata for a platform
 
         Parameters
         -----------
         platform_id : bson.ObjectId
             the platform id we need their owner user id
         metadata_name : str
             a specific field of metadata that we want
 
         Returns
         ---------
-        metadata_value : Any
+        metadata_value : str | dict | list
             the values that the metadata belongs to
dags/hivemind_etl_helpers/website_etl.py (6)

20-20: Fix typo in docstring

There's a typo in the docstring: "ingegration" should be "integration".

Apply this diff to correct the typo:

-        notion ingegration access token
+        notion integration access token

94-94: Declare _prepare_id as a @staticmethod

Since the method _prepare_id does not use self, it should be declared as a @staticmethod to indicate that it doesn't rely on instance variables.

Apply this diff to declare the method as static:

+    @staticmethod
     def _prepare_id(data: str) -> str:

94-109: Remove unused method _prepare_id if not needed

The _prepare_id method is defined but not used anywhere in the class. If it's not required, consider removing it to keep the codebase clean.


98-107: Ensure consistent docstring formatting

The docstring sections use inconsistent underline lengths. For better readability, the number of dashes should match the length of the section titles.

Example correction:

     """
     hash the given data to prepare an id for the document

     Parameters
-    -----------
+    ----------
     data : str
         the string data to be hashed

     Returns
-    --------
+    -------
     hashed_data : str
         data hashed using sha256 algorithm
     """

86-87: Correct the type hint in the docstring

In the load method's docstring, the type hint for documents should be updated to match the method signature.

Apply this diff to correct the type hint:

     Parameters
     -------------
-    documents: list[llama_index.Document]
+    documents: list[Document]
         the llama-index documents to be ingested
     """

55-64: Ensure consistent use of type hints in docstrings

In the transform method's docstring, ensure that the type hints are consistent with the method signature.

Apply this diff to correct the return type in the docstring:

     Returns
     ---------
-    documents : list[llama_index.Document]
+    documents : list[Document]
         list of llama-index documents
     """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8b859b8 and deb917c.

📒 Files selected for processing (7)
  • dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1 hunks)
  • dags/hivemind_etl_helpers/src/utils/modules/__init__.py (1 hunks)
  • dags/hivemind_etl_helpers/src/utils/modules/modules_base.py (2 hunks)
  • dags/hivemind_etl_helpers/src/utils/modules/website.py (1 hunks)
  • dags/hivemind_etl_helpers/website_etl.py (1 hunks)
  • dags/hivemind_website_ingestion.py (1 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🪛 Ruff
dags/hivemind_website_ingestion.py

36-36: Local variable links is assigned to but never used

Remove assignment to unused variable links

(F841)


44-44: Undefined name selected_channels

(F821)


45-45: Undefined name from_date

(F821)


52-52: Undefined name start_discord_vectorstore

(F821)


70-70: Undefined name ModulesDiscord

(F821)

🔇 Additional comments (5)
dags/hivemind_etl_helpers/src/utils/modules/__init__.py (1)

8-8: LGTM! The import follows the established pattern.

The new import maintains consistency with the existing import style and alphabetical ordering.

Let's verify the existence of the imported module and class:

✅ Verification successful

Import verified and implementation looks correct

The ModulesWebsite class exists in the website.py module and is properly implemented as a subclass of ModulesBase with appropriate website-specific functionality for fetching community data. The import in __init__.py is valid and consistent with the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of website.py and ModulesWebsite class
# Expected: Find the website.py file and confirm ModulesWebsite class definition

# Check if website.py exists in the same directory
fd --type f "website.py" dags/hivemind_etl_helpers/src/utils/modules/

# Verify ModulesWebsite class definition
ast-grep --pattern 'class ModulesWebsite' dags/hivemind_etl_helpers/src/utils/modules/website.py

Length of output: 223


Script:

#!/bin/bash
# Let's verify the actual implementation of ModulesWebsite class
ast-grep --pattern 'class ModulesWebsite {
  $$$
}'

Length of output: 53


Script:

#!/bin/bash
# Let's try a different pattern for Python class definition
ast-grep --pattern 'class ModulesWebsite:
  $$$'

Length of output: 50


Script:

#!/bin/bash
# Let's check the content of the website.py file directly
cat dags/hivemind_etl_helpers/src/utils/modules/website.py

Length of output: 2067

requirements.txt (1)

25-25: Verify crawlee version and security implications

The specified version 0.3.8 is not the latest release, and being a 0.x version suggests potential API instability.

Let's verify the version and security status:

✅ Verification successful

Consider upgrading crawlee to the latest stable version

The current version 0.3.8 is outdated compared to the latest version 0.4.2b4 (released Nov 15, 2024). While no security vulnerabilities were found, using an older version might miss important fixes and improvements. However, since the latest version is still in beta (0.4.2b4), staying on 0.3.8 could be a conscious choice for stability.

  • Consider upgrading to the latest stable version in the 0.3.x series if available
  • Document the reason for pinning to 0.3.8 if this version is specifically required

Consider upgrading to crawlee 0.3.9

The pinned version 0.3.8 is not the latest stable release in the 0.3.x series. Version 0.3.9 is available and would provide the latest bug fixes while maintaining API compatibility. No security vulnerabilities were found for any version.

  • Update the requirement to crawlee[playwright]==0.3.9
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check latest version and release date
curl -s https://pypi.org/pypi/crawlee/json | jq -r '.releases | keys_unsorted[-1] as $latest | "Latest version: \($latest)\nRelease date: \(.[$latest][0].upload_time)"'

# Check for any security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "crawlee") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 569


Script:

#!/bin/bash
# Check all available versions to find latest stable in 0.3.x series
curl -s https://pypi.org/pypi/crawlee/json | jq -r '.releases | keys[]' | grep '^0\.3\.' | sort -V

Length of output: 513

dags/hivemind_etl_helpers/src/utils/modules/website.py (1)

6-9: LGTM! Clean class structure with proper inheritance.

The class is well-structured with appropriate inheritance from ModulesBase and proper initialization sequence in the constructor.

dags/hivemind_etl_helpers/src/utils/modules/modules_base.py (1)

101-101: Verify the impact of the return type change

The return type change from str | dict to str | dict | list looks good and aligns with the website ingestion requirements. However, let's verify that all callers can handle the new list type.

✅ Verification successful

Let me analyze the usage of get_platform_metadata across the codebase to verify the impact of the return type change.

From the search results, I can see all the callers and how they handle the return value:

  1. In website.py: Used to get "resources" and stored in website_links
  2. In modules_base.py: Used to get "userId" which is then converted to ObjectId
  3. In github.py: Used to get "account" with explicit type checking for dict
  4. In mediawiki.py: Used to get "baseURL" and "path" with explicit type checking for str

Let me verify if there are any other usages with a different search pattern.


Return type change to str | dict | list is safe to proceed

All callers of get_platform_metadata are properly handling the return types:

  • website.py: Accepts list type for "resources" metadata (the reason for this change)
  • github.py: Has explicit type checking for dict type with isinstance(account, dict)
  • mediawiki.py: Has explicit type checking for str type with isinstance(path, str) and isinstance(base_url, str)
  • modules_base.py: Uses it for "userId" which is always expected to be str (converted to ObjectId)

No other usages were found that could be impacted by the addition of the list type.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all calls to get_platform_metadata to verify they handle the new list return type

# Search for direct method calls
rg -A 5 "get_platform_metadata\(" 

# Search for variable assignments that might need type checking
ast-grep --pattern 'metadata = $_.get_platform_metadata($$$)'

Length of output: 3453


Script:

#!/bin/bash
# Search for any other variable assignments or usages of get_platform_metadata
rg -A 2 "= .*get_platform_metadata\(" 
rg -A 2 "get_platform_metadata\(.*\)\." 

Length of output: 1728

dags/hivemind_etl_helpers/website_etl.py (1)

31-50: Verify that crawlee_client.crawl is asynchronous

Ensure that the crawl method in CrawleeClient is an asynchronous function since it's being awaited in the extract method.

Run the following script to confirm the crawl method is asynchronous:

✅ Verification successful

Let's search for the implementation of the CrawleeClient class to verify if the crawl method is asynchronous. If we don't find it directly, we can try other search patterns.


crawl method is correctly implemented as asynchronous

The crawl method in CrawleeClient is properly defined as an async function and uses await statements for asynchronous operations with the crawler, which matches the usage in the extract method.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if `crawl` method in `CrawleeClient` is asynchronous.

# Search for the definition of `crawl` method
fd --type f --name '*.py' | xargs grep -E 'async def crawl\('

Length of output: 314


Script:

#!/bin/bash
# Let's try a different approach using ripgrep to find the CrawleeClient class and its crawl method
rg -A 5 "class CrawleeClient" 

Length of output: 584


Script:

#!/bin/bash
# Let's examine the full content of the crawlee_client.py file to find the crawl method
cat dags/hivemind_etl_helpers/src/db/website/crawlee_client.py

Length of output: 2120

dags/hivemind_etl_helpers/src/db/website/crawlee_client.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py Outdated Show resolved Hide resolved
dags/hivemind_website_ingestion.py Outdated Show resolved Hide resolved
dags/hivemind_website_ingestion.py Outdated Show resolved Hide resolved
dags/hivemind_website_ingestion.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/website_etl.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/website_etl.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (6)
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (6)

10-18: Consider removing environment variable loading from unit tests

Loading environment variables in unit tests can lead to non-deterministic behavior and make tests dependent on the environment. Consider mocking any required environment variables directly in the test setup for better isolation and reproducibility.

 def setUp(self):
     """
     Setup for the test cases. Initializes a WebsiteETL instance with mocked dependencies.
     """
-    load_dotenv()
     self.community_id = "test_community"
     self.website_etl = WebsiteETL(self.community_id)
     self.website_etl.crawlee_client = AsyncMock()
     self.website_etl.ingestion_pipeline = MagicMock()

20-38: Add error handling test cases

The test covers the happy path but should also verify error handling scenarios. Consider adding test cases for:

  • Empty URL list
  • Failed crawl attempts
  • Invalid URLs

Example addition:

async def test_extract_empty_urls(self):
    """Test extract method with empty URL list."""
    urls = []
    extracted_data = await self.website_etl.extract(urls)
    self.assertEqual(extracted_data, [])
    self.website_etl.crawlee_client.crawl.assert_not_awaited()

async def test_extract_crawl_failure(self):
    """Test extract method when crawl fails."""
    urls = ["https://example.com"]
    self.website_etl.crawlee_client.crawl.side_effect = Exception("Crawl failed")
    
    with self.assertRaises(Exception):
        await self.website_etl.extract(urls)

39-64: Add edge case tests for transform method

The current test covers the basic transformation, but should also verify:

  • Empty raw data
  • Missing fields in the input data
  • Special characters in text/title

Example addition:

def test_transform_empty_data(self):
    """Test transform method with empty data."""
    self.assertEqual(self.website_etl.transform([]), [])

def test_transform_missing_fields(self):
    """Test transform method with missing fields."""
    raw_data = [{"url": "https://example.com"}]  # missing inner_text and title
    documents = self.website_etl.transform(raw_data)
    self.assertEqual(documents[0].text, "")  # verify default handling

65-82: Enhance load method test coverage

Consider adding test cases for:

  • Multiple documents
  • Empty document list
  • Pipeline failures

Example addition:

def test_load_multiple_documents(self):
    """Test load method with multiple documents."""
    documents = [
        Document(doc_id="doc1", text="text1"),
        Document(doc_id="doc2", text="text2")
    ]
    self.website_etl.load(documents)
    self.website_etl.ingestion_pipeline.run_pipeline.assert_called_once_with(
        docs=documents
    )

def test_load_pipeline_failure(self):
    """Test load method when pipeline fails."""
    self.website_etl.ingestion_pipeline.run_pipeline.side_effect = Exception("Pipeline failed")
    with self.assertRaises(Exception):
        self.website_etl.load([Document(doc_id="doc1", text="text1")])

83-92: Add input validation tests for _prepare_id method

The current test only verifies basic string hashing. Consider adding test cases for:

  • Empty strings
  • Non-string inputs
  • Unicode characters
  • Very long inputs

Example addition:

def test_prepare_id_edge_cases(self):
    """Test _prepare_id method with various edge cases."""
    # Empty string
    self.assertTrue(WebsiteETL._prepare_id(""))
    
    # Unicode
    unicode_text = "Hello 世界"
    self.assertTrue(WebsiteETL._prepare_id(unicode_text))
    
    # Long string
    long_text = "a" * 10000
    self.assertTrue(WebsiteETL._prepare_id(long_text))
    
    # Non-string input
    with self.assertRaises(AttributeError):
        WebsiteETL._prepare_id(None)

1-92: Consider improving test organization and documentation

While the test coverage is good for basic scenarios, consider these improvements:

  1. Group related test cases using nested TestCase classes
  2. Add docstring examples showing input/output for complex scenarios
  3. Consider using parameterized tests for testing multiple similar cases
  4. Add integration tests for the complete ETL pipeline

Example test organization:

class TestWebsiteETL(IsolatedAsyncioTestCase):
    class TestExtract(IsolatedAsyncioTestCase):
        # Extract-related tests
        
    class TestTransform(IsolatedAsyncioTestCase):
        # Transform-related tests
        
    class TestLoad(IsolatedAsyncioTestCase):
        # Load-related tests
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0c314d9 and 1b90754.

📒 Files selected for processing (3)
  • dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (1 hunks)
  • dags/hivemind_etl_helpers/website_etl.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dags/hivemind_etl_helpers/website_etl.py
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py

74-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


116-116: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (10)
dags/hivemind_etl_helpers/website_etl.py (4)

3-3: Remove unused import

The hashlib module is imported but never used in the code.

Apply this diff to remove the unused import:

-import hashlib
🧰 Tools
🪛 Ruff

3-3: hashlib imported but unused

Remove unused import: hashlib

(F401)


14-19: Enhance class and method documentation

The docstring could be more comprehensive by including:

  • Class purpose and responsibilities
  • Description of the collection_name parameter
  • Examples of usage

Consider expanding the docstring:

 """
+WebsiteETL handles the Extract, Transform, Load pipeline for website data.
+
 Parameters
 -----------
 community_id : str
     the community to save its data
+collection_name : str
+    name of the collection in the vector database (defaults to "website")
+
+Example
+-------
+>>> etl = WebsiteETL(community_id="community123")
+>>> documents = await etl.extract(urls=["https://example.com"])
+>>> transformed = etl.transform(documents)
+>>> etl.load(transformed)
 """

29-49: Add URL validation

Consider validating the input URLs before crawling to ensure they are well-formed and accessible.

Consider adding validation:

 async def extract(
     self,
     urls: list[str],
 ) -> list[dict[str, Any]]:
+    # Validate URLs
+    from urllib.parse import urlparse
+    for url in urls:
+        result = urlparse(url)
+        if not all([result.scheme, result.netloc]):
+            raise ValueError(f"Invalid URL format: {url}")
+
     extracted_data = await self.crawlee_client.crawl(urls)
     return extracted_data

67-67: Consider using a more robust document ID strategy

Using the URL directly as the document ID might lead to collisions if the same URL is processed multiple times. Consider using a combination of URL and timestamp or a hash of the content.

Consider implementing a more robust ID strategy:

-doc_id = data["url"]
+from datetime import datetime
+doc_id = f"{data['url']}_{datetime.utcnow().isoformat()}"
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (6)

6-6: Remove unused import

The hashlib module is imported but never used in the code.

-from llama_index.core import Document
-import hashlib
+from llama_index.core import Document
🧰 Tools
🪛 Ruff

6-6: hashlib imported but unused

Remove unused import: hashlib

(F401)


11-13: Enhance setUp docstring

The docstring could be more descriptive about the mocked components and their purposes.

-        """
-        Setup for the test cases. Initializes a WebsiteETL instance with mocked dependencies.
-        """
+        """
+        Setup for the test cases.
+        
+        Initializes a WebsiteETL instance with:
+        - A mocked crawlee_client for simulating web crawling
+        - A mocked ingestion_pipeline for testing data loading
+        """

20-38: Enhance test coverage for extract method

While the happy path is well tested, consider adding these test cases:

  1. Empty URL list
  2. Failed crawl scenario
  3. Malformed response data

Example implementation:

async def test_extract_empty_urls(self):
    """Test extract method with empty URL list."""
    urls = []
    extracted_data = await self.website_etl.extract(urls)
    self.assertEqual(extracted_data, [])
    self.website_etl.crawlee_client.crawl.assert_awaited_once_with(urls)

async def test_extract_crawl_failure(self):
    """Test extract method when crawl fails."""
    urls = ["https://example.com"]
    self.website_etl.crawlee_client.crawl.side_effect = Exception("Crawl failed")
    with self.assertRaises(Exception):
        await self.website_etl.extract(urls)

39-64: Add edge cases to transform method tests

Consider adding test cases for:

  1. Empty input data
  2. Missing optional fields
  3. Special characters in text/title

Example implementation:

def test_transform_edge_cases(self):
    """Test transform method with edge cases."""
    # Test empty input
    self.assertEqual(self.website_etl.transform([]), [])
    
    # Test missing optional fields
    raw_data = [{"url": "https://example.com", "inner_text": "text"}]  # missing title
    documents = self.website_etl.transform(raw_data)
    self.assertEqual(documents[0].metadata.get("title"), None)
    
    # Test special characters
    raw_data = [{
        "url": "https://example.com",
        "inner_text": "Text with 特殊字符",
        "title": "Title with 🚀"
    }]
    documents = self.website_etl.transform(raw_data)
    self.assertEqual(documents[0].text, "Text with 特殊字符")

65-81: Enhance load method test coverage

The current test covers the basic case, but consider adding:

  1. Error handling test for pipeline failures
  2. Test with empty document list
  3. Test with multiple documents

Example implementation:

def test_load_pipeline_failure(self):
    """Test load method when pipeline fails."""
    documents = [Document(doc_id="test", text="text")]
    self.website_etl.ingestion_pipeline.run_pipeline.side_effect = Exception("Pipeline failed")
    with self.assertRaises(Exception):
        self.website_etl.load(documents)

def test_load_empty_documents(self):
    """Test load method with empty document list."""
    self.website_etl.load([])
    self.website_etl.ingestion_pipeline.run_pipeline.assert_called_once_with(docs=[])

def test_load_multiple_documents(self):
    """Test load method with multiple documents."""
    documents = [
        Document(doc_id="1", text="text1"),
        Document(doc_id="2", text="text2")
    ]
    self.website_etl.load(documents)
    self.website_etl.ingestion_pipeline.run_pipeline.assert_called_once_with(docs=documents)

1-81: Consider implementing a test data factory

To improve test maintainability and reduce duplication, consider implementing a test data factory class that generates test documents and raw data. This would make the tests more maintainable and ensure consistency across test cases.

Example implementation:

class WebsiteTestDataFactory:
    @staticmethod
    def create_raw_data(url="https://example.com", text="Example text", title="Example"):
        return {
            "url": url,
            "inner_text": text,
            "title": title
        }
    
    @staticmethod
    def create_document(doc_id="https://example.com", text="Example text", title="Example"):
        return Document(
            doc_id=doc_id,
            text=text,
            metadata={"title": title, "url": doc_id}
        )
🧰 Tools
🪛 Ruff

6-6: hashlib imported but unused

Remove unused import: hashlib

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ad7d945 and 6a8f6a5.

📒 Files selected for processing (3)
  • dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (1 hunks)
  • dags/hivemind_etl_helpers/website_etl.py (1 hunks)
  • dags/hivemind_website_ingestion.py (1 hunks)
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py

6-6: hashlib imported but unused

Remove unused import: hashlib

(F401)

dags/hivemind_etl_helpers/website_etl.py

3-3: hashlib imported but unused

Remove unused import: hashlib

(F401)

🔇 Additional comments (1)
dags/hivemind_website_ingestion.py (1)

50-51: LGTM! DAG structure is correct.

The use of task.expand for parallel processing of communities is appropriate and the task dependencies are correctly implied.

dags/hivemind_website_ingestion.py Show resolved Hide resolved
dags/hivemind_website_ingestion.py Outdated Show resolved Hide resolved
dags/hivemind_website_ingestion.py Show resolved Hide resolved
dags/hivemind_website_ingestion.py Outdated Show resolved Hide resolved
dags/hivemind_etl_helpers/website_etl.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (6)
dags/hivemind_etl_helpers/website_etl.py (2)

9-27: Consider making collection_name configurable

The collection name "website" is hardcoded. Consider making it configurable through the constructor to improve flexibility and reusability.

     def __init__(
         self,
         community_id: str,
+        collection_name: str = "website",
     ) -> None:
         """
         Parameters
         -----------
         community_id : str
             the community to save its data
+        collection_name : str, optional
+            the name of the collection (default: "website")
         """
         self.community_id = community_id
-        collection_name = "website"

         # preparing the data extractor and ingestion pipelines
         self.crawlee_client = CrawleeClient()
         self.ingestion_pipeline = CustomIngestionPipeline(
             self.community_id, collection_name=collection_name
         )

49-78: Add validation for created documents

Consider validating the created documents to ensure they contain meaningful content before returning them.

     def transform(self, raw_data: list[dict[str, Any]]) -> list[Document]:
         """
         transform raw data to llama-index documents
         ...
         """
         documents: list[Document] = []
 
         for data in raw_data:
             doc_id = data["url"]
             doc = Document(
                 doc_id=doc_id,
                 text=data["inner_text"],
                 metadata={
                     "title": data["title"],
                     "url": data["url"],
                 },
             )
+            # Validate document content
+            if not doc.text.strip():
+                continue
             documents.append(doc)
 
+        if not documents:
+            raise ValueError("No valid documents created from the raw data")
         return documents
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (4)

10-18: Consider mocking environment variables instead of loading them in setUp

Loading real environment variables in setUp could lead to test isolation issues and make tests dependent on the environment. Consider mocking environment variables or setting up test-specific values.

 def setUp(self):
     """
     Setup for the test cases. Initializes a WebsiteETL instance with mocked dependencies.
     """
-    load_dotenv()
     self.community_id = "test_community"
     self.website_etl = WebsiteETL(self.community_id)
     self.website_etl.crawlee_client = AsyncMock()
     self.website_etl.ingestion_pipeline = MagicMock()

20-38: Enhance test coverage with error cases and data validation

While the happy path is well tested, consider adding:

  1. Error cases (e.g., network failures, invalid URLs)
  2. Edge cases (e.g., empty URL list, malformed responses)
  3. Schema validation for the mocked_data structure

Example additional test:

async def test_extract_with_network_error(self):
    """Test the extract method when crawling fails."""
    urls = ["https://example.com"]
    self.website_etl.crawlee_client.crawl.side_effect = Exception("Network error")
    
    with self.assertRaises(Exception):
        await self.website_etl.extract(urls)

39-64: Add test cases for edge cases in transform method

Consider adding tests for:

  1. Empty or missing fields in raw_data
  2. Special characters in text/title
  3. Multiple documents transformation
  4. Invalid URLs

Example additional test case:

def test_transform_with_missing_fields(self):
    """Test transform with incomplete data."""
    raw_data = [
        {
            "url": "https://example.com",
            # missing inner_text and title
        }
    ]
    documents = self.website_etl.transform(raw_data)
    self.assertEqual(documents[0].text, "")  # or expected fallback

65-81: Enhance load testing with error handling and scalability tests

Consider adding tests for:

  1. Pipeline errors or exceptions
  2. Large document sets to verify performance
  3. Document validation failures

Example additional test:

def test_load_with_pipeline_error(self):
    """Test load method when pipeline fails."""
    documents = [Document(doc_id="test", text="text")]
    self.website_etl.ingestion_pipeline.run_pipeline.side_effect = Exception("Pipeline error")
    
    with self.assertRaises(Exception):
        self.website_etl.load(documents)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6a8f6a5 and 9a2bbbb.

📒 Files selected for processing (4)
  • dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (1 hunks)
  • dags/hivemind_etl_helpers/website_etl.py (1 hunks)
  • dags/hivemind_website_ingestion.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dags/hivemind_website_ingestion.py
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py

75-75: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


117-117: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (3)
dags/hivemind_etl_helpers/website_etl.py (2)

1-8: LGTM! Well-structured ETL implementation

The imports are appropriate and the class structure follows the ETL pattern with clear separation of concerns.


79-89: Past review comment about error handling is still applicable

The previous review comment suggesting error handling for the pipeline execution is still valid and should be addressed.

dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1)

1-14: LGTM! Clean imports and class definition.

The imports are well-organized, and the class is properly defined with type hints. Good use of defusedxml for secure XML parsing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (7)
dags/hivemind_etl_helpers/website_etl.py (5)

9-27: Add input validation for community_id

Consider validating the community_id parameter to ensure it's not empty or malformed.

 def __init__(
     self,
     community_id: str,
 ) -> None:
+    if not community_id or not isinstance(community_id, str):
+        raise ValueError("community_id must be a non-empty string")
     self.community_id = community_id
     collection_name = "website"

28-53: Add URL format validation

Consider validating the URL format before crawling to fail fast on malformed URLs.

+from urllib.parse import urlparse
+
 async def extract(
     self,
     urls: list[str],
 ) -> list[dict[str, Any]]:
     if not urls:
         raise ValueError("No URLs provided for crawling")
+    
+    for url in urls:
+        try:
+            result = urlparse(url)
+            if not all([result.scheme, result.netloc]):
+                raise ValueError(f"Invalid URL format: {url}")
+        except Exception as e:
+            raise ValueError(f"Invalid URL: {url}. Error: {str(e)}")
+
     extracted_data = await self.crawlee_client.crawl(urls)

54-83: Add input validation for raw_data structure

Consider validating the structure of raw_data entries before processing to ensure required fields are present.

 def transform(self, raw_data: list[dict[str, Any]]) -> list[Document]:
+    required_fields = {"url", "inner_text", "title"}
+    for data in raw_data:
+        missing_fields = required_fields - set(data.keys())
+        if missing_fields:
+            raise ValueError(
+                f"Missing required fields {missing_fields} in data: {data}"
+            )
+
     documents: list[Document] = []
     for data in raw_data:

84-94: Add input validation for documents

Consider validating the documents list to ensure it's not empty before attempting to load.

 def load(self, documents: list[Document]) -> None:
+    if not documents:
+        raise ValueError("No documents provided for loading")
+
     # loading data into db
     self.ingestion_pipeline.run_pipeline(docs=documents)

28-53: Consider implementing rate limiting for web crawling

The extract method could benefit from rate limiting to avoid overwhelming target websites. Consider implementing rate limiting in the CrawleeClient or at this layer.

dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (2)

21-23: Document configuration choices.

Consider adding a docstring or comment explaining why persistence and metadata writing are disabled. This will help future maintainers understand the rationale behind these configuration choices.


25-44: Add URL validation in the request handler.

Consider validating the URL format before processing to prevent potential issues with malformed URLs. This is especially important since the handler is processing both regular pages and sitemaps.

         async def request_handler(context: PlaywrightCrawlingContext) -> None:
             context.log.info(f"Processing {context.request.url} ...")
+            if not context.request.url or not context.request.url.startswith(("http://", "https://")):
+                context.log.warning(f"Skipping invalid URL: {context.request.url}")
+                return

             inner_text = await context.page.inner_text(selector="body")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7710f05 and 6542ae5.

📒 Files selected for processing (3)
  • dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1 hunks)
  • dags/hivemind_etl_helpers/website_etl.py (1 hunks)
  • dags/hivemind_website_ingestion.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dags/hivemind_website_ingestion.py
🧰 Additional context used
📓 Learnings (1)
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1)
Learnt from: amindadgar
PR: TogetherCrew/airflow-dags#325
File: dags/hivemind_etl_helpers/src/db/website/crawlee_client.py:15-24
Timestamp: 2024-11-19T06:57:30.667Z
Learning: In this project, the `PlaywrightCrawler` does not support the attributes `max_requests_per_second` and `respect_robots_txt`.
🪛 Ruff
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py

74-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


116-116: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (4)
dags/hivemind_etl_helpers/website_etl.py (1)

1-8: LGTM: Clean imports and class definition

The imports are appropriate and the class is well-defined following Python conventions.

dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (3)

1-14: LGTM: Clean imports and well-structured class definition.

The imports are appropriate, and type hints are used correctly. The class constructor parameters are well-documented with sensible defaults.


115-116: 🛠️ Refactor suggestion

Add exception chaining for timeout error.

Preserve the exception context by using raise ... from e.

         except asyncio.TimeoutError as e:
-            raise TimeoutError("Crawl operation timed out")
+            raise TimeoutError("Crawl operation timed out") from e

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff

116-116: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


73-74: 🛠️ Refactor suggestion

Add exception chaining for better error tracking.

When re-raising exceptions, use raise ... from e to preserve the original exception context.

         except ET.ParseError as e:
-            raise ValueError(f"Invalid sitemap XML: {str(e)}")
+            raise ValueError(f"Invalid sitemap XML: {str(e)}") from e

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff

74-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

@amindadgar amindadgar merged commit 023ec51 into main Nov 19, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Ingestion Website Extractor
1 participant